Question 1: Given some sample data, write a program to answer the following: click here to access the required data set

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of 3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

Data import

EDA

The data have records for 30 day window Data has 5000 rows with 100 shop_ids Mean value for "order_amount" is 3145.128000 There is no missing value as no NaN appeared in the describe results. There are no duplicate rows.
1a) Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.
Average Order Value in the dataset is found to be 3145.128. So it appears that the naive calculation of AOV comes from the mean value (average of all the order_amount data) for "order_amount". The aov here is calculated based on the total order amount of each shop, which doesn't give the picture about the average order value of each item. Since each shop sells only one type of item, the computation of average order value would be different.
Calculating aov per shop
As per the given question, assuming each of these shops sells only one model of shoe at a reasonable price, we are trying to identify customer average spending for a sneaker. If we consider the naive calculation, then the price appears higher for shoe purchase. This AOV is calculated by averaging across all 100 shops order amount. However, we should calculate the AOV per shop and then average. The actual order value is 387.7428 by computing the order value of each product in the shop by dividing number of items sold and the total order value.

Anomaly detection

However, I can also find some discrepancy in the data, for the shop_id 78. Here is the explanation for the same:
This clearly indicates that there is a major outlier in the dataset, which is for the shop_id 78. It is highly impossible that the cost of sneakers can be 25k. There are two ways of looking at this: - There would be a mismatch in the data entry ( Example: Dollars entered as cents) - Different currency being used to specify the cost, instead of dollars. Hence there is a need to remove the outlier value to obtain the most accurate average order value for the rest of the shops
In the filtered data, we can see that the data still has an outlier (352) corresponding to the shop_id 42. This doesn't seem to be unrealistic as it can be one of the highly priced sneakers in the list, which is still affordable. Hence we don't remove this from the actual dataset.

Results for the cleaned data

This shows that all the measures of central tendency gives almost the same results for the clean dataset.

Results for the original data

CONCLUSION:

Considering the data with anomalies, one of the best ways to evaluate the average order value is by using the mode or the median method, instead of mean , as it gives better accurate results as shown before
1b) What metric would you report for this dataset?

By following the best practices and taking no findings into consideration, we aren't cleaning the data. Here, metric to be used can be Mode or Median, which looks like a more affordable value than the mean of aov

1c) What is its value?
The metric value in the original dataset by following a new approach is $153
2a) How many orders were shipped by Speedy Express in total?
SELECT count(orderID) FROM orders o inner join shippers s on o.shipperid = s.shipperid where s.ShipperName = "Speedy Express"
ANSWER : 54
2b) What is the last name of the employee with the most orders?
SELECT e.LastName from orders o join employees e on o.employeeID = e.employeeID group by o.employeeID order by count(o.OrderID) desc LIMIT 1
ANSWER : PEACOCK
2c) What product was ordered the most by customers in Germany?
select p.ProductName from OrderDetails od join Orders o on od.OrderID = o.OrderID join Customers c on o.CustomerID = c.CustomerID join Products p on p.ProductID = od.ProductID where c.country = "Germany" group by p.productID order by sum(quantity) desc limit 1
ANSWER : Boston Crab Meat